AITopics | wu & yang

Collaborating Authors

wu & yang

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

How much is a noisy image worth? Data Scaling Laws for Ambient Diffusion

Daras, Giannis, Cherapanamjeri, Yeshwanth, Daskalakis, Constantinos

arXiv.org Artificial IntelligenceNov-4-2024

The quality of generative models depends on the quality of the data they are trained on. Creating large-scale, high-quality datasets is often expensive and sometimes impossible, e.g. in certain scientific applications where there is no access to clean data due to physical or instrumentation constraints. Ambient Diffusion and related frameworks train diffusion models with solely corrupted data (which are usually cheaper to acquire) but ambient models significantly underperform models trained on clean data. We study this phenomenon at scale by training more than $80$ models on data with different corruption levels across three datasets ranging from $30,000$ to $\approx 1.3$M samples. We show that it is impossible, at these sample sizes, to match the performance of models trained on clean data when only training on noisy data. Yet, a combination of a small set of clean data (e.g.~$10\%$ of the total dataset) and a large set of highly noisy data suffices to reach the performance of models trained solely on similar-size datasets of clean data, and in particular to achieve near state-of-the-art performance. We provide theoretical evidence for our findings by developing novel sample complexity bounds for learning from Gaussian Mixtures with heterogeneous variances. Our theoretical model suggests that, for large enough datasets, the effective marginal utility of a noisy sample is exponentially worse than that of a clean sample. Providing a small set of clean samples can significantly reduce the sample size requirements for noisy data, as we also observe in our experiments.

data quality, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2411.0278

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.88)
Information Technology > Data Science > Data Quality > Data Cleaning (0.76)

Add feedback

FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control

von Rütte, Dimitri, Biggio, Luca, Kilcher, Yannic, Hofmann, Thomas

arXiv.org Machine LearningFeb-1-2022

Generating music with deep neural networks has been an area of active research in recent years. While the quality of generated samples has been steadily increasing, most methods are only able to exert minimal control over the generated sequence, if any. We propose the self-supervised description-to-sequence task, which allows for fine-grained controllable generation on a global level. We do so by extracting high-level features about the target sequence and learning the conditional distribution of sequences given the corresponding high-level description in a sequence-to-sequence modelling setup. We train FIGARO (FIne-grained music Generation via Attention-based, RObust control) by applying description-to-sequence modelling to symbolic music. By combining learned high level features with domain knowledge, which acts as a strong inductive bias, the model achieves state-of-the-art results in controllable symbolic music generation and generalizes well beyond the training distribution.

arxiv, figaro, sequence, (11 more...)

arXiv.org Machine Learning

2201.10936

Country:

Europe > Switzerland > Zürich > Zürich (0.04)
North America > United States (0.04)
Europe > Austria > Vienna (0.04)

Genre: Research Report (0.84)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Support Estimation via Regularized and Weighted Chebyshev Approximations

I, null, Chien, null, Milenkovic, Olgica

arXiv.org Machine LearningJan-22-2019

We introduce a new framework for estimating the support size of an unknown distribution which improves upon known approximation-based techniques. Our main contributions include describing a rigorous new weighted Chebyshev polynomial approximation method and introducing regularization terms into the problem formulation that provably improve the performance of state-of-the-art approximation-based approaches. In particular, we present both theoretical and computer simulation results that illustrate the utility and performance improvements of our method. The theoretical analysis relies on jointly optimizing the bias and variance components of the risk, and combining new weighted minmax polynomial approximation techniques with discretized semi-infinite programming solvers. Such a setting allows for casting the estimation problem as a linear program (LP) with a small number of variables and constraints that may be solved as efficiently as the original Chebyshev approximation-based problem. The described approach also applies to the support coverage and entropy estimation problems. Our newly developed technique is tested on synthetic data and used to estimate the number of bacterial species in the human gut. On synthetic datasets, we observed up to five-fold improvements in the value of the worst-case risk. For the bioinformatics application, metagenomic data from the NIH Human Gut and the American Gut Microbiome was combined and processed to obtain lists of bacterial taxonomies. These were subsequently used to compute the bacterial species histograms and estimate the number of bacterial species in the human gut to roughly 2350, with the species being represented by trillions of cells.

estimator, support estimation, wu & yang, (11 more...)

arXiv.org Machine Learning

1901.07506

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.04)
North America > United States > Illinois (0.04)
Europe > Ukraine > Kharkiv Oblast > Kharkiv (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback